<!DOCTYPE html>
<html lang="en">
    <!--
      Licensed to the Apache Software Foundation (ASF) under one or more
      contributor license agreements.  See the NOTICE file distributed with
      this work for additional information regarding copyright ownership.
      The ASF licenses this file to You under the Apache License, Version 2.0
      (the "License"); you may not use this file except in compliance with
      the License.  You may obtain a copy of the License at
          http://www.apache.org/licenses/LICENSE-2.0
      Unless required by applicable law or agreed to in writing, software
      distributed under the License is distributed on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
      See the License for the specific language governing permissions and
      limitations under the License.
    -->
    <head>
        <meta charset="utf-8"/>
        <title>XMLReader</title>
        <link rel="stylesheet" href="../../../../../css/component-usage.css" type="text/css"/>
    </head>

    <body>
    <p>
        The XMLReader Controller Service reads XML content and creates Record objects. The Controller Service
        must be configured with a schema that describes the structure of the XML data. Fields in the XML data
        that are not defined in the schema will be skipped. Depending on whether the property "Expect Records as Array"
        is set to "false" or "true", the reader either expects a single record or an array of records for each FlowFile.
    </p>

    <p>
        Example: Single record
    </p>
    <code>
            <pre>
                &lt;record&gt;
                  &lt;field1&gt;content&lt;/field1&gt;
                  &lt;field2&gt;content&lt;/field2&gt;
                &lt;/record&gt;
            </pre>
    </code>

    <p>
        An array of records has to be enclosed by a root tag.
        Example: Array of records
    </p>

    <code>
            <pre>
                &lt;root&gt;
                  &lt;record&gt;
                    &lt;field1&gt;content&lt;/field1&gt;
                    &lt;field2&gt;content&lt;/field2&gt;
                  &lt;/record&gt;
                  &lt;record&gt;
                    &lt;field1&gt;content&lt;/field1&gt;
                    &lt;field2&gt;content&lt;/field2&gt;
                  &lt;/record&gt;
                &lt;/root&gt;
            </pre>
    </code>

    <h2>Example: Simple Fields</h2>

    <p>
        The simplest kind of data within XML data are tags / fields only containing content (no attributes, no embedded tags).
        They can be described in the schema by simple types (e. g. INT, STRING, ...).
    </p>

    <code>
            <pre>
                &lt;root&gt;
                  &lt;record&gt;
                    &lt;simple_field&gt;content&lt;/simple_field&gt;
                  &lt;/record&gt;
                &lt;/root&gt;
            </pre>
    </code>

    <p>
        This record can be described by a schema containing one field (e. g. of type string). By providing this schema,
        the reader expects zero or one occurrences of "simple_field" in the record.
    </p>

    <code>
            <pre>
                {
                  "namespace": "nifi",
                  "name": "test",
                  "type": "record",
                  "fields": [
                    { "name": "simple_field", "type": "string" }
                  ]
                }
            </pre>
    </code>

    <h2>Example: Arrays with Simple Fields</h2>

    <p>
        Arrays are considered as repetitive tags / fields in XML data. For the following XML data, "array_field" is considered
        to be an array enclosing simple fields, whereas "simple_field" is considered to be a simple field not enclosed in
        an array.
    </p>

    <code>
            <pre>
                &lt;record&gt;
                  &lt;array_field&gt;content&lt;/array_field&gt;
                  &lt;array_field&gt;content&lt;/array_field&gt;
                  &lt;simple_field&gt;content&lt;/simple_field&gt;
                &lt;/record&gt;
            </pre>
    </code>

    <p>
        This record can be described by the following schema:
    </p>

    <code>
            <pre>
                {
                  "namespace": "nifi",
                  "name": "test",
                  "type": "record",
                  "fields": [
                    { "name": "array_field", "type":
                      { "type": "array", "items": "string" }
                    },
                    { "name": "simple_field", "type": "string" }
                  ]
                }
            </pre>
    </code>

    <p>
        If a field in a schema is embedded in an array, the reader expects zero, one or more occurrences of the field
        in a record. The field "array_field" principally also could be defined as a simple field, but then the second occurrence
        of this field would replace the first in the record object. Moreover, the field "simple_field" could also be defined
        as an array. In this case, the reader would put it into the record object as an array with one element.
    </p>

    <h2>Example: Tags with Attributes</h2>

    <p>
        XML fields frequently not only contain content, but also attributes. The following record contains a field with
        an attribute "attr" and content:
    </p>

    <code>
            <pre>
                &lt;record&gt;
                  &lt;field_with_attribute attr="attr_content"&gt;content of field&lt;/field_with_attribute&gt;
                &lt;/record&gt;
            </pre>
    </code>

    <p>
        To parse the content of the field "field_with_attribute" together with the attribute "attr", two requirements have
        to be fulfilled:
    </p>

    <ul>
        <li>In the schema, the field has to be defined as record.</li>
        <li>The property "Field Name for Content" has to be set.</li>
        <li>As an option, the property "Attribute Prefix" also can be set.</li>
    </ul>

    <p>
        For the example above, the following property settings are assumed:
    </p>

    <table>
        <tr>
            <th>Property Name</th>
            <th>Property Value</th>
        </tr>
        <tr>
            <td>Field Name for Content</td>
            <td><code>field_name_for_content</code></td>
        </tr>
        <tr>
            <td>Attribute Prefix</td>
            <td><code>prefix_</code></td>
        </tr>
    </table>

    <p>
        The schema can be defined as follows:
    </p>

    <code>
            <pre>
                {
                  "name": "test",
                  "namespace": "nifi",
                  "type": "record",
                  "fields": [
                    {
                      "name": "field_with_attribute",
                      "type": {
                        "name": "RecordForTag",
                        "type": "record",
                        "fields" : [
                          {"name": "attr", "type": "string"},
                          {"name": "field_name_for_content", "type": "string"}
                        ]
                    }
                  ]
                }
            </pre>
    </code>

    <p>
        Note that the field "field_name_for_content" not only has to be defined in the property section, but also in the
        schema, whereas the prefix for attributes is not part of the schema. It will be appended when an attribute named
        "attr" is found at the respective position in the XML data and added to the record. The record object of the above
        example will be structured as follows:
    </p>

    <code>
            <pre>
                Record (
                    Record "field_with_attribute" (
                        RecordField "prefix_attr" = "attr_content",
                        RecordField "field_name_for_content" = "content of field"
                    )
                )
            </pre>
    </code>

    <p>
        Principally, the field "field_with_attribute" could also be defined as a simple field. In this case, the attributes
        simply would be ignored. Vice versa, the simple field in example 1 above could also be defined as a record (assuming that
        the property "Field Name for Content" is set.
    </p>

    <h2>Example: Tags within tags</h2>

    <p>
        XML data is frequently nested. In this case, tags enclose other tags:
    </p>

    <code>
            <pre>
                &lt;record&gt;
                  &lt;field_with_embedded_fields attr=&quot;attr_content&quot;&gt;
                    &lt;embedded_field&gt;embedded content&lt;/embedded_field&gt;
                    &lt;another_embedded_field&gt;another embedded content&lt;/another_embedded_field&gt;
                  &lt;/field_with_embedded_fields&gt;
                &lt;/record&gt;
            </pre>
    </code>

    <p>
        The enclosing fields always have to be defined as records, irrespective whether they include attributes to be
        parsed or not. In this example, the tag "field_with_embedded_fields" encloses the fields "embedded_field" and
        "another_embedded_field", which are both simple fields. The schema can be defined as follows:
    </p>

    <code>
            <pre>
                {
                  "name": "test",
                  "namespace": "nifi",
                  "type": "record",
                  "fields": [
                    {
                      "name": "field_with_embedded_fields",
                      "type": {
                        "name": "RecordForEmbedded",
                        "type": "record",
                        "fields" : [
                          {"name": "attr", "type": "string"},
                          {"name": "embedded_field", "type": "string"},
                          {"name": "another_embedded_field", "type": "string"}
                        ]
                    }
                  ]
                }
            </pre>
    </code>

    <p>
        Notice that this case does not require the property "Field Name for Content" to be set as this is only required
        for tags containing attributes and content.
    </p>

    <h2>Example: Array of records</h2>

    <p>
        For further explanation of the logic of this reader, an example of an array of records shall be demonstrated.
        The following record contains the field "array_field", which repeatedly occurs. The field contains two
        embedded fields.
    </p>

    <code>
            <pre>
                &lt;record&gt;
                  &lt;array_field&gt;
                    &lt;embedded_field&gt;embedded content 1&lt;/embedded_field&gt;
                    &lt;another_embedded_field&gt;another embedded content 1&lt;/another_embedded_field&gt;
                  &lt;/array_field&gt;
                  &lt;array_field&gt;
                    &lt;embedded_field&gt;embedded content 2&lt;/embedded_field&gt;
                    &lt;another_embedded_field&gt;another embedded content 2&lt;/another_embedded_field&gt;
                  &lt;/array_field&gt;
                &lt;/record&gt;
            </pre>
    </code>

    <p>
        This XML data can be parsed similarly to the data in example 4. However, the record defined in the schema of
        example 4 has to be embedded in an array.
    </p>

    <code>
            <pre>
                {
                  "namespace": "nifi",
                  "name": "test",
                  "type": "record",
                  "fields": [
                    { "name": "array_field",
                      "type": {
                        "type": "array",
                        "items": {
                          "name": "RecordInArray",
                          "type": "record",
                          "fields" : [
                            {"name": "embedded_field", "type": "string"},
                            {"name": "another_embedded_field", "type": "string"}
                          ]
                        }
                      }
                    }
                  ]
                }
            </pre>
    </code>

    <h2>Example: Array in record</h2>

    <p>
        In XML data, arrays are frequently enclosed by tags:
    </p>

    <code>
            <pre>
                &lt;record&gt;
                  &lt;field_enclosing_array&gt;
                    &lt;element&gt;content 1&lt;/element&gt;
                    &lt;element&gt;content 2&lt;/element&gt;
                  &lt;/field_enclosing_array&gt;
                  &lt;field_without_array&gt; content 3&lt;/field_without_array&gt;
                &lt;/record&gt;
            </pre>
    </code>

    <p>
        For the schema, embedded tags have to be described by records. Therefore, the field "field_enclosing_array"
        is a record that embeds an array with elements of type string:
    </p>

    <code>
            <pre>
                {
                  "namespace": "nifi",
                  "name": "test",
                  "type": "record",
                  "fields": [
                    { "name": "field_enclosing_array",
                      "type": {
                        "name": "EmbeddedRecord",
                        "type": "record",
                        "fields" : [
                          {
                            "name": "element",
                            "type": {
                              "type": "array",
                              "items": "string"
                            }
                          }
                        ]
                      }
                    },
                    { "name": "field_without_array", "type": "string" }
                  ]
                }
            </pre>
    </code>


    <h2>Example: Maps</h2>

    <p>
        A map is a field embedding fields with different names:
    </p>

    <code>
            <pre>
                &lt;record&gt;
                  &lt;map_field&gt;
                    &lt;field1&gt;content&lt;/field1&gt;
                    &lt;field2&gt;content&lt;/field2&gt;
                    ...
                  &lt;/map_field&gt;
                  &lt;simple_field&gt;content&lt;/simple_field&gt;
                &lt;/record&gt;
            </pre>
    </code>

    <p>
        This data can be processed using the following schema:
    </p>

    <code>
            <pre>
                {
                  "namespace": "nifi",
                  "name": "test",
                  "type": "record",
                  "fields": [
                    { "name": "map_field", "type":
                      { "type": "map", "items": string }
                    },
                    { "name": "simple_field", "type": "string" }
                  ]
                }
            </pre>
    </code>


    <h2>Schema Inference</h2>

    <p>
        While NiFi's Record API does require that each Record have a schema, it is often convenient to infer the schema based on the values in the data,
        rather than having to manually create a schema. This is accomplished by selecting a value of "Infer Schema" for the "Schema Access Strategy" property.
        When using this strategy, the Reader will determine the schema by first parsing all data in the FlowFile, keeping track of all fields that it has encountered
        and the type of each field. Once all data has been parsed, a schema is formed that encompasses all fields that have been encountered.
    </p>

    <p>
        A common concern when inferring schemas is how to handle the condition of two values that have different types. For example, consider a FlowFile with the following two records:
    </p>
    <code><pre>
<root>
    <record>
        <name>John</name>
        <age>8</age>
        <values>N/A</values>
    </record>
    <record>
        <name>Jane</name>
        <age>Ten</age>
        <values>8</values>
        <values>Ten</values>
    </record>
</root>
</pre></code>

    <p>
        It is clear that the "name" field will be inferred as a STRING type. However, how should we handle the "age" field? Should the field be an CHOICE between INT and STRING? Should we
        prefer LONG over INT? Should we just use a STRING? Should the field be considered nullable?
    </p>

    <p>
        To help understand how this Record Reader infers schemas, we have the following list of rules that are followed in the inference logic:
    </p>

    <ul>
        <li>All fields are inferred to be nullable.</li>
        <li>
            When two values are encountered for the same field in two different records (or two values are encountered for an ARRAY type), the inference engine prefers
            to use a "wider" data type over using a CHOICE data type. A data type "A" is said to be wider than data type "B" if and only if data type "A" encompasses all
            values of "B" in addition to other values. For example, the LONG type is wider than the INT type but not wider than the BOOLEAN type (and BOOLEAN is also not wider
            than LONG). INT is wider than SHORT. The STRING type is considered wider than all other types with the Exception of MAP, RECORD, ARRAY, and CHOICE.
        </li>
        <li>
            If two values are encountered for the same field in two different records (or two values are encountered for an ARRAY type), but neither value is of a type that
            is wider than the other, then a CHOICE type is used. In the example above, the "values" field will be inferred as a CHOICE between a STRING or an ARRRAY&lt;STRING&gt;.
        </li>
        <li>
            If the "Time Format," "Timestamp Format," or "Date Format" properties are configured, any value that would otherwise be considered a STRING type is first checked against
            the configured formats to see if it matches any of them. If the value matches the Timestamp Format, the value is considered a Timestamp field. If it matches the Date Format,
            it is considered a Date field. If it matches the Time Format, it is considered a Time field. In the unlikely event that the value matches more than one of the configured
            formats, they will be matched in the order: Timestamp, Date, Time. I.e., if a value matched both the Timestamp Format and the Date Format, the type that is inferred will be
            Timestamp. Because parsing dates and times can be expensive, it is advisable not to configure these formats if dates, times, and timestamps are not expected, or if processing
            the data as a STRING is acceptable. For use cases when this is important, though, the inference engine is intelligent enough to optimize the parsing by first checking several
            very cheap conditions. For example, the string's length is examined to see if it is too long or too short to match the pattern. This results in far more efficient processing
            than would result if attempting to parse each string value as a timestamp.
        </li>
        <li>The MAP type is never inferred. Instead, the RECORD type is used.</li>
        <li>If two elements exist with the same name and the same parent (i.e., two sibling elements have the same name), the field will be inferred to be of type ARRAY.</li>
        <li>If a field exists but all values are null, then the field is inferred to be of type STRING.</li>
    </ul>



    <h2>Caching of Inferred Schemas</h2>

    <p>
        This Record Reader requires that if a schema is to be inferred, that all records be read in order to ensure that the schema that gets inferred is applicable for all
        records in the FlowFile. However, this can become expensive, especially if the data undergoes many different transformations. To alleviate the cost of inferring schemas,
        the Record Reader can be configured with a "Schema Inference Cache" by populating the property with that name. This is a Controller Service that can be shared by Record
        Readers and Record Writers.
    </p>

    <p>
        Whenever a Record Writer is used to write data, if it is configured with a "Schema Cache," it will also add the schema to the Schema Cache. This will result in an
        identifier for that schema being added as an attribute to the FlowFile.
    </p>

    <p>
        Whenever a Record Reader is used to read data, if it is configured with a "Schema Inference Cache", it will first look for a "schema.cache.identifier" attribute on the FlowFile.
        If the attribute exists, it will use the value of that attribute to lookup the schema in the schema cache. If it is able to find a schema in the cache with that identifier,
        then it will use that schema instead of reading, parsing, and analyzing the data to infer the schema. If the attribute is not available on the FlowFile, or if the attribute is
        available but the cache does not have a schema with that identifier, then the Record Reader will proceed to infer the schema as described above.
    </p>

    <p>
        The end result is that users are able to chain together many different Processors to operate on Record-oriented data. Typically, only the first such Processor in the chain will
        incur the "penalty" of inferring the schema. For all other Processors in the chain, the Record Reader is able to simply lookup the schema in the Schema Cache by identifier.
        This allows the Record Reader to infer a schema accurately, since it is inferred based on all data in the FlowFile, and still allows this to happen efficiently since the schema
        will typically only be inferred once, regardless of how many Processors handle the data.
    </p>



    </body>
</html>
